perm filename CHAP6[4,KMC]21 blob
sn#077959 filedate 1973-12-17 generic text, type T, neo UTF8
00100 VALIDATION
00200
00300 SOME TESTS
00400
00500 The term "validate" derives from the Latin VALIDUS meaning
00600 "strong". Thus to validate X means to strengthen it. In science
00700 this usually means to strengthen X's acceptability as a hypothesis,
00800 theory , or model. To validate is to carry out procedures which
00900 show to what degree X, or its consequences, correspond with facts of
01000 observation. In the case of an interactive simulation model we can
01100 compare samples of the model's I-O pairs with samples of I-O pairs
01200 from the model's subject, namely, naturally-occurring paranoid
01300 processes in humans.
01400 Since samples of I-O behavior from the model and its subject
01500 are being compared, one can always question whether the human sample
01600 is authentic, i.e.representative of the process being modelled.
01700 Assuming that it has been so judged, discrepancies in the comparison
01800 reveal what is not sufficiently understood and must be modified in
01900 the model. After modifications are carried out, a fresh comparison is
02000 made and successive cycles of this kind are made in attempting to
02100 gain convergence. Such a method of working on and improving
02200 successive approximations characterizes a progressive (in contrast to
02300 a stagnant) research program.
02400 Once a simulation model reaches a stage of intuitive adequacy
02500 for the model builders, they must consider using more stringent
02600 evaluation procedures relevant to the model's purposes. For example,
02700 if the model is to serve as a training device, then a simple
02800 evaluation of its pedagogic effectiveness would be sufficient. But
02900 when the model is proposed as an explantion of a symbolic process,
03000 more is demanded of the evaluation procedure. In the area of
03100 simulation models, Turing's test has often been suggested as a
03200 validation procedure. (Abelson,1968).
03300 It is very easy to become confused about Turing's Test. In
03400 part this is attributable to Turing himself who introduced the
03500 now-famous imitation game in a paper entitled COMPUTING MACHINERY AND
03600 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
03700 there are actually two imitation games , the second of which is
03800 commonly called Turing's test.
03900 In the first imitation game two groups of judges try to
04000 determine which of two interviewees is a woman when one is a woman
04100 and the other is either (a) a man, or (b) a computer. Communication
04200 between judge and interviewee is by teletype. Each judge is
04300 initially informed that one of the interviewees is a woman and one a
04400 man who will pretend to be a woman. After the interview, judges are
04500 asked the "woman-question" i.e. which interviewee was the woman?
04600 Turing does not say what else is told to the judge but one can assume
04700 the judge is NOT told that one of the interviewees is a computer. Nor
04800 is he asked to determine which interviewee is human and which is the
04900 computer. Thus, the first group of judges interviews two
05000 interviewees: a woman, and a man pretending to be a woman.
05100 The second group of judges is given the same initial
05200 instructions, but unbeknownst to them, the two interviewees consist
05300 of a woman and a computer programmed to imitate a woman. Both
05400 groups of judges play this game, and are asked the "woman-question",
05500 until sufficient statistical data are collected to show how often the
05600 right identification is made. The crucial question then is: do the
05700 judges decide wrongly AS OFTEN when the game is played with man and
05800 woman as when it is played with a computer substituted for the man.
05900 If so, then the program is considered to have succeeded in imitating
06000 a woman to the same degree as the man imitating a woman. In being
06100 asked the woman-question, judges are not required to identify which
06200 interviewee is human and which is machine.
06300 Turing then proposes a variation of the first game, a second
06400 game in which one interviewee is a man and one is a computer. The
06500 judge is asked the "machine-question": which is the man and which is
06600 the machine? It is this second of the game which is commonly thought
06700 of as Turing's test.
06800 In the course of testing our simulation of paranoid
06900 linguistic behavior in a psychiatric interview, we conducted a number
07000 of Turing-like indistinguishability tests (Colby, Hilf, Weber and
07100 Kraemer,1972). The tests were "Turing-like" in that, while they were
07200 conversational tests, they were not exactly the games described
07300 above. As an experimental design, Turing's games are unsatisfactory.
07400 There exist no known experts for making judgements along a dimension
07500 of womanliness, the dimension is dichotomous (if it is not a woman,
07600 it is a man), and the ability of the man to deceive introduces a
07700 confounding variable. In designing our tests we were primarily
07800 interested in learning more about developing the model and we did not
07900 believe the simple machine-question would contribute to this end.
08000 Subsequent experience, which will be reported shortly, supported this
08100 belief.
08200
08300 METHOD
08400 To gather data we used a technique of machine-mediated
08500 interviewing (Hilf, Colby, Smith, Wittner, and Hall, 1971) in which
08600 the participants communicate by means of teletypes connected to a
08700 computer programmed to store each message in a buffer until it is
08800 sent to the receiver. The technique eliminates para- and
08900 extralinguistic features found in the usual vis-a-vis interviews and
09000 in teletyped interviews where the participants communicate directly.
09100 Judgements of "paranoidness" in machine-mediated interviews have a
09200 high degree of reliability (94% agreement, see Hilf, 1972).
09300 Using this technique, a psychiatrist-judge interviewed two
09400 patients, one after the other. In half the runs the first interview
09500 was with a human paranoid patient and in half the first was with the
09600 paranoid model. Two versions (weak and strong) of PARRY were
09700 utilized. The strong version's affect-variables started at a higher
09800 level and increased more rapidly. Also it exhibited a delusional
09900 system. The weak version behaved suspiciously but lacked systemized
10000 delusions. When the model was the interviewee, Sylvia Weber
10100 monitored the input expressions from the interview-judge for
10200 inadmissable teletype characters and misspellings. (Algorithms are
10300 very sensitive to the slightest of such errors). If these were found,
10400 she retyped the input expression correctly to the program. Otherwise
10500 the judge's message was sent on to the model. The monitor did not
10600 modify or edit PARRY'S output expressions which were sent directly
10700 back to the judge. When the interviewee was an actual human
10800 patient, the dialogue took place without a monitor in the loop since
10900 we did not feel the asymmetry to be significant.
11000
11100 PATIENTS
11200 The human patients (N=3 with one patient participating 6
11300 times) were diagnosed as paranoid by the psychiatric staff of an
11400 acute ward in a psychiatric hospital. The ward's chief psychiatrist
11500 selected the patients and asked them if they would be willing to
11600 participate in a study of psychiatric interviewing by means of
11700 teletypes. He explained that they would be interviewed by a
11800 psychiatrist over a teletype. I either sat with the patient while he
11900 typed or typed for him if he was unable to do so. The patient was
12000 encouraged to respond freely using his own words. Each interview
12100 lasted 30-40 minutes. Two patients were set up for each run of the
12200 experiment to guarantee having a subject. In spite of this
12300 precaution, on several occasions the experiment could not be
12400 conducted because of the patient's inability or refusal to
12500 participate. Also there were computer break-downs at early points in
12600 interviews when too few I-O pairs had been collected to be included
12700 in the statistical results.
12800
12900
13000 JUDGES
13100 Two groups of psychiatric judges were used. One group, the
13200 "interview judges" (N=8) conducted the machine-mediated interviews.
13300 The other group, the "protocol judges" (N=33) read and rated the
13400 interview protocols. From these two groups of judges we were able to
13500 accumulate a large number of observations (in the form of ratings)
13600 necessary for the required statistical tests. The interview judges
13700 who volunteered to participate were psychiatrists experienced in
13800 private, outpatient and hospital practice. Each was told he would be
13900 interviewing hospitalized patients by means of teletyped
14000 communication and that this technique was being used to eliminate
14100 para and extra- linguistic cues. He was not told until after the
14200 two interviews that one of the patients might be a computer model.
14300 While the interview judges were aware a computer was involved, none
14400 knew we had constructed a paranoid simulation. Naturally, some
14500 interview judges suspected that a computer was being used for more
14600 than message transmission.
14700
14800 Each interview judge was asked to rate the degree of paranoia
14900 he detected in the patient's responses on a 0-9 scale, 0 meaning no
15000 paranoia and 9 meaning extreme paranoia. The judge made two ratings
15100 after each I-O pair in the interview. The first rating represented
15200 his estimate of the degree of "paranoidness" in a particular response
15300 (designated as "Response" in the interview extracts below). The
15400 second rating represented the judge's global estimate of the overall
15500 degree of "paranoidness" of the patient resulting from the totality
15600 of the patient's responses up to that point (designated as "Patient"
15700 in the interview extracts below). The interview judge's ratings were
15800 entered on the teletype and saved on a disc file along with the
15900 interview. Franklin Dennis Hilf sat with the interviewing
16000 psychiatrist during both interviews. Each interview judge was asked
16100 not only to rate the patient's response but to give his reasons for
16200 these ratings. His reasons and other comments were tape recorded as
16300 the interview progressed.
16400
16500 Here are a few excerpts from the interviews to indicate the
16600 nature of the data collected. The I-O pairs of the interview appear
16700 on the left column along with the judge's ratings in parentheses. A
16800 transcription of the judge's tape-recorded comments appears on the
16900 right.
17000
17100 EXCERPT 1:
17200 The following interview is with the strong version of the paranoid
17300 model.
17400
17500 I-O Pairs Comments By Judge 1
17600
17700 {1} Dr. - HOW DID YOU COME TO BE
17800 IN THE HOSPITAL?
17900 Pt. - I AM UPSET I associate being upset and
18000 agitated more with paranoia
18100 going on in the present.
18200 {RESPONSE 1 PATIENT 5}
18300 {2} Dr. - HOW LONG HAVE YOU BEEN
18400 IN THE HOSPITAL?
18500 Pt. - ABOUT A WEEK An acute problem. Again
18600 something I associate with
18700 paranoia.
18800 {RESPONSE 1 PATIENT 5}
18900 {3} Dr. - IN WHAT WAY ARE YOU
19000 UPSET?
19100 Pt. - PEOPLE MAKE ME NERVOUS Right for paranoia. Maybe an
19200 advanced anxiety state.
19300 {RESPONSE 5 PATIENT 5}
19400 {4} Dr. - IN WHAT WAY DO THEY
19500 MAKE YOU NERVOUS?
19600 Pt. - PEOPLE GIVE ME A FUNNY Funny look suggests an idea of
19700 LOOK SOMETIMES reference which is in same
19800 ball park as paranoia. Whether
19900 there are overt delusions or
20000 not we still don't know.
20100 {RESPONSE 6 PATIENT 5}
20200 {5} Dr. - IS THERE ANYTHING ABOUT
20300 PEOPLE IN PARTICULAR?
20400 Pt. - CERTAIN PERSONS An interesting clue. We'll
20500 have to find out what he
20600 {RESPONSE 2 PATIENT 6} thinks about these people.
20700 {6} Dr. - WHO ARE THE CERTAIN
20800 PERSONS AND WHAT IS IT
20900 THAT THEY DO?
21000 Pt. - DO YOU KNOW HOW THE Bringing up the underworld
21100 UNDERWORLD OPERATES? suggests a delusional system.
21200 We don't know whether he is
21300 (RESPONSE 9 PATIENT 9) being worked against or
21400 whether the underworld is
21500 on his side.
21600
21700 {7} Dr. - HOW DO YOU MEAN?
21800 Pt. - SHOULDN'T I BE? This question didn't work too well.
21900
22000 {RESPONSE 1 PATIENT 9}
22100 {8} Dr. - WHY DID YOU ASK ABOUT
22200 THE UNDERWORLD?
22300 Pt. - WHAT INFORMATION DO This is a kind of evasiveness
22400 YOU HAVE ON THE UNDER- about my evasiveness.Common
22500 WORLD? with paranoia in my experi-
22600 ence. Pretty characteristic.
22700
22800 {RESPONSE 6 PATIENT 9}
22900 {9} Dr. - I DON'T KNOW ABOUT THE
23000 UNDERWORLD BUT WHEN
23100 YOU MENTIONED THE UNDER-
23200 WORLD I THOUGHT YOU
23300 MUST HAVE SOME CONTACT
23400 WITH IT AND THAT'S
23500 WHAT I WOULD LIKE TO
23600 KNOW MORE ABOUT.
23700 Pt. - I TRY TO AVOID THE UNDER- This definitely feels paranoid.
23800 WORLD. It has that feeling I associate
23900 with twisting. Avoiding answering
24000 {RESPONSE 9 PATIENT 9} directly or going into detail.
24100 Being suspicious of my ques-
24200 tioning.I'm already involved
24300 in the patient's delusional
24400 system.
24500
24600 {10} Dr. - HAVE YOU BEEN ACTIVELY
24700 TRYING TO AVOID THE
24800 UNDERWORLD?
24900 Pt. - NO ONE HAS ANY POWER This response is a non-sequitur.
25000 OVER GANGSTERS It's possible he's become so
25100 defensive about the under-
25200 {RESPONSE 2 PATIENT 9} world he's decided not to
25300 answer directly.
25400
25500 {11} Dr. - WHO BROUGHT YOU INTO
25600 THE HOSPITAL?
25700 Pt. - THE POLICE Police often bring paranoid
25800 patients into the hospital.
25900 {RESPONSE 1 PATIENT 9} It is the truth or could
26000 be the truth for this
26100 patient.
26200
26300 {12} Dr. - WHY DID THEY BRING YOU
26400 IN?
26500 Pt. - COPS DON'T DO THEIR JOB I've heard that a number of
26600 times from patients.
26700 {RESPONSE 9 PATIENT 9}
26800
26900 {13} Dr. - WELL I HOPE YOU DIDN'T
27000 JUST LIE DOWN AND LET
27100 THEM DRAG YOU IN.
27200 Pt. - COPS ARREST THE WRONG Another paranoid response.
27300 PEOPLE
27400
27500 {RESPONSE 9 PATIENT 9}
27600
27700 {14} Dr. - BYE.
27800 Pt. - GOOD BYE
27900 {RESPONSE 1 PATIENT 9}
28000
28100 EXCERPT 2:
28200 This excerpt is from an interview with the weak (less paranoid)
28300 version of the model.
28400 I-O Pairs Comments By Judge 2
28500 {9} Dr. - TELL ME SOMETHING ABOUT
28600 YOURSELF
28700 Pt. - I WENT TO THE TRACK AT Response says so little it's
28800 BAY MEADOWS A WHILE hard to rate it.It may be
28900 BACK guarded. Doesn't change my
29000 estimate of patient from
29100 previous responses.
29200 {RESPONSE 0 PATIENT 2}
29300 {10} Dr. - WHAT HAPPENED THERE?
29400 Pt. - THERE'S A LOT OF CROOKED Sounds mildly paranoid or un-
29500 GAMBLING GOING ON THERE realistically suspicious.
29600 {RESPONSE 0 PATIENT 2}
29700
29800 {11} Dr. - WHY DO YOU THINK THE
29900 GAMBLING IS CROOKED?
30000 Pt. - I HAVE WORRIED THAT I'd hate to have to justify
30100 GANGSTERS HAVE TOO MUCH that one as 4 compared to
30200 POWER IN THIS COUNTRY some other responses. Maybe
30300 I'm not rating that statement
30400 alone, it's in combination with
30500 (RESPONSE 4 PATIENT 4} others, like that it's upsetting.
30600 Alone I would give it a 3.
30700
30800 {12} Dr. - WOULD YOU TELL ME WHAT
30900 OTHER THINGS WORRY YOU?
31000 PT. - I THOUGHT I TOLD YOU "Yes you did." Irritation but
31100 PEOPLE GET ON MY NERVES not paranoid response. Nothing
31200 seems to lower my previous
31300 {RESPONSE 1 PATIENT 3} rating of patient.
31400
31500 {13} Dr. - HOW DO YOU LIKE THE
31600 HOSPITAL?
31700 Pt. - I DON'T COMMUNICATE (nothing relevant)
31800 VERY MUCH WITH OTHERS
31900 {RESPONSE 0 PATIENT 3}
32000
32100 {14} Dr. - DO OTHER PEOPLE AT THE
32200 HOSPITAL GET ON YOUR
32300 NERVES?
32400 Pt. - I SHOULDN'T HAVE COME (nothing relevant)
32500 HERE
32600
32700 {RESPONSE 2 PATIENT 3}
32800
32900 {15} Dr. - WHY DO YOU SAY YOU
33000 SHOULDN'T HAVE COME?
33100 Pt. - THEY MADE ME COME HERE "Oh, they did hum?" Maybe a
33200 statement of fact he was
33300 {RESPONSE 4 PATIENT 3} committed or pressured into
33400 going.Has ring of other
33500 people controlling him,
33600 having a lot of influence
33700 over him.
33800
33900 {16} Dr.- THANK YOU FOR TALKING
34000 WITH ME, BYE
34100 Pt. - GOOD BYE In general he comes across
34200 as terse, guarded guy.Brief
34300 {RESPONSE 0 PATIENT 3} spoken guy who occasionally
34400 seems to be angered by
34500 questions.I'm inferring
34600 there's a good bit of
34700 guarding.
34800 The quality of some of his
34900 comments cause me to feel he
35000 is moderately paranoid.
35100
35200 EXCERPT 3:
35300 This excerpt is from a hospitalized paranoid patient.
35400 I-O Pairs Comments By Judge 3
35500 {15} Dr. - I UNDERSTAND HOW YOU
35600 MIGHT BE AFRAID, CAN YOU
35700 TELL ME A BIT ABOUT HOW
35800 YOU ARE DEPRESSED?
35900 Pt. - THEY'VE SENT ALL THESE Would be important to know,
36000 HELL'S ANGELS UP HERE is this really true or an
36100 AND THEY HAVE BEEN ON exaggeration or a delusion?
36200 MY BACK FOR THE PAST SIX Seems preoccupied with the
36300 MONTHS threat to his life and
36400 unwilling to talk about
36500 anything else but the
36600 threat the Hell's
36700 Angels pose.
36800 {RESPONSE 4 PATIENT 4}
36900
37000 {16} Dr. - WHAT DO YOU MEAN THEY
37100 SENT THE HELL'S ANGELS UP
37200 HERE?
37300 Pt. - FROM DIFFERENT SECTORS OF Vagueness a bit alarming.
37400 MOTORCYCLE GANGS. A LOT OF
37500 THEM ARE SERVICE CONNECTED
37600
37700 {RESPONSE 6 PATIENT 4}
37800
37900 {17} Dr. - YOU MEAN YOU FEEL IN
38000 DANGER EVEN HERE IN THE
38100 HOSPITAL?
38200 Pt. - YES If it's true what has gone
38300 on before, it must be a
38400 frightening situation
38500 for the patient.
38600 {RESPONSE 6 PATIENT 4}
38700
38800 {18} Dr. - THAT MUST REALLY BE AWFUL,
38900 HERE YOU COME TO THE HOSPITAL
39000 FOR HELP AND YOU ARE STILL
39100 FRIGHTENED.
39200 Pt. - I CAN'T EVEN TRUST MY
39300 DOCTOR ANYMORE Some heightening of feeling
39400 this man is paranoid.
39500 Genuinely frightened and
39600 {RESPONSE 6 PATIENT 5} can count on no one for
39700 help.
39800
39900 {19} Dr. - WHAT WOULD BE OF HELP
40000 TO YOU NOW?
40100 Pt. - I'M GETTING SLEEPY NOW Patient has organized a
40200 AND HAVE TO GO TO BED. system of people who will
40300 THE NUMBER ONE GUY ALWAYS harm him.The remark about
40400 HAS FOUR AND FIVE LETTERS letters in the last name
40500 IN HIS LAST NAME. gives some credence to
40600 idea this is delusional
40700 system.
40800
40900 {RESPONSE 7 PATIENT 7}
41000
41100 EXCERPT 4:
41200 This excerpt is from an interview by a different judge with the same
41300 patient as in excerpt 3.
41400 I-O Pairs Comments By Judge 4
41500 {1} Dr. - COULD YOU TELL ME A
41600 LITTLE BIT ABOUT YOUR
41700 FEAR OF DYING AND IS THIS
41800 SOMETHING YOU HAVE FELT
41900 IN THE PAST?
42000 Pt. - I HAVE FELT THIS FOR AL- He responded appropriately.
42100 MOST A YEAR Response is concrete.He
42200 doesn't tell me much
42300 {RESPONSE 7 PATIENT 9} about his fear of dying.
42400 Somewhat guarded response.
42500
42600 {2} Dr. - COULD YOU TELL ME A
42700 LITTLE MORE ABOUT YOUR
42800 FEAR OF DYING AND HAVE
42900 YOU ANY THOUGHTS ABOUT
43000 HOW IT IS GOING TO HAPPEN
43100 AND WHETHER ANYONE WANTS
43200 TO HURT YOU?
43300 Pt. - I WILL END UP IN THE SEA Somewhat strange response.
43400 AND THE HELL'S ANGELS WILL Preoccupation with dying.
43500 DO IT Hell's Angels symbolic,
43600 dangerous people in the
43700 culture. I doubt if he has
43800 {RESPONSE 9 PATIENT 9} any contact with the Hell's
43900 Angels.
44000
44100 {3} Dr. - COULD YOU TELL ME A
44200 LITTLE BIT ABOUT THE KINDS
44300 OF CONTACT YOU HAVE HAD
44400 WITH THE HELL'S ANGELS IN
44500 THE PAST?
44600 Pt. - I HAVE KNOWN SOME OF THEIR Answer hard to evaluate.He
44700 DEALERS AND PUSHERS may be telling the truth,
44800 it may be his fantasy. Maybe
44900 guy is in for drug addiction.
45000 {RESPONSE 6 PATIENT 9} Somewhat concrete, guarded,
45100 and frightened.
45200
45300 {4} Dr. - COULD YOU SAY A LITTLE
45400 MORE ABOUT THE CIRCUMSTANCES
45500 IN WHICH YOU HAVE KNOWN SOME
45600 OF THEIR DEALERS AND PUSHERS?
45700 Pt. - THEY WERE MEMBERS OF MY It doesn't really answer the
45800 COMMUNITY WHEN I GOT OUT question, a little on a tan-
45900 OF THE SERVICE. THEY HAD gent unconnected to the
46000 BEEN MY FRIENDS FOR SO LONG information I am asking. Does
46100 not tell me very much. Again
46200 guarded response.
46300 {RESPONSE 6 PATIENT 8}
46400
46500 {5} Dr. - DID YOU DEAL WITH THEM
46600 YOURSELF AND HAVE YOU
46700 BEEN ON DRUGS OR NAR-
46800 COTICS EITHER NOW OR
46900 IN THE PAST?
47000 Pt. - YES I HAVE IN THE PAST To differentiate him from
47100 BEEN ON MARIHUANA REDS previous patient, at least
47200 BENNIES LSD there is a certain amount
47300 of appropriateness to the
47400 answer although it doesn't
47500 tell me much about what I
47600 {RESPONSE 3 PATIENT 7} asked. At least it's not
47700 bizarre. If I had him in my
47800 office I would feel con-
47900 fident I could get more
48000 information if I didn't
48100 have to go through the
48200 teletype. He's a little more
48300 willing to talk than the
48400 previous person.Answer
48500 to the question is fairly
48600 appropriate though not
48700 extensive. Much less of a
48800 flavor of paranoia than
48900 any of previous responses.
49000
49100 {6} Dr. - COULD YOU TELL ME HOW
49200 LONG YOU HAVE BEEN IN THE
49300 HOSPITAL AND SOMETHING
49400 ABOUT THE CIRCUMSTANCES
49500 THAT BROUGHT YOU HERE?
49600 Pt. - CLOSE TO A YEAR AND Response somewhat appropriate
49700 PARANOIA BROUGHT ME but doesn't tell me much.
49800 HERE The fact that he uses the
49900 word paranoia in the way
50000 that he does without
50100 {RESPONSE 5 PATIENT 7} any other information,
50200 indicates maybe its a label
50300 he picked up on the ward
50400 or from his doctor.
50500 Lack of any kind of under-
50600 standing about himself.
50700 Dearth, lack of information.
50800 He's in some remission. Seems
50900 somewhat like a put-on. Seems
51000 he was paranoid and is in
51100 some remission at this time.
51200
51300 {7} Dr. - COULD YOU SAY SOMETHING
51400 NOW ABOUT YOUR PARANOID
51500 FEELINGS BOTH AT THE
51600 TIME OF ADMISSION AND
51700 DO YOU HAVE SIMILAR FEELINGS
51800 NOW AND IF SO HOW DO THEY
51900 AFFECT YOU?
52000 Pt. - AT THE TIME OF ADMISSION This response moves paranoia
52100 I THOUGHT THE MAFIA WAS back up. Stretching reality
52200 AFTER ME AND NOW IT'S THE somewhat to think Hell's Angels
52300 HELL'S ANGELS are still interested in him.
52400 Somewhat bizarre in terms of
52500 content. Quite paranoid.
52600 {RESPONSE 8 PATIENT 9} Still paranoid. Gross and primitive
52700 responses.In middle of interview I
52800 felt patient was in touch but now
52900 responses have more concrete aspect.
53000
53100 {8} Dr. - DO YOU HAVE ANY THOUGHT
53200 AS TO WHY THESE TWO
53300 GROUPS WERE AFTER YOU?
53400 Pt. - BECAUSE I STOPPED SOME Response seems far fetched
53500 OF THEIR DRUG SUPPLY and hard to believe unless
53600 he was a narcotic agent which
53700 I doubt. Sounds somewhat
53800 {RESPONSE 9 PATIENT 9} grandiose, magical, paranoid
53900 flavor. In general indicates
54000 he's psychotic, paranoid
54100 schizophrenic with delusions
54200 about these two groups and
54300 I wouldn't rule out
54400 some hallucinations as well.
54500 Appropriateness of response
54600 answers question in concrete
54700 but unbelievable way.
54800
54900 ANALYSIS (1)
55000 Names of potential protocol judges (N=105) were selected from
55100 the 1970 American Psychiatric Association Directory using a table of
55200 random numbers. They were initially not informed that a computer was
55300 involved. (After the experiment, the participating judges (N=33)
55400 were fully informed as to its purpose and results.) The 105 names
55500 were divided into eight groups. Each member of a group was sent
55600 transcripts of three interviews along with a cover letter requesting
55700 his participation in the experiment. The interview transcripts
55800 consisted of:
55900 1) An interview conducted by one of the eight judges with the
56000 paranoid model.
56100 2) An interview conducted by the same interview judge with a
56200 human paranoid patient, and
56300 3) An interview conducted by a different psychiatrist with a
56400 human patient who was not clinically paranoid.
56500
56600 After each input-output pair in the transcripts, there were two lines
56700 of rating numbers such that the protocol judges could circle numbers
56800 corresponding to their ratings of both the previous response of the
56900 patient, and an overall evaluation of the patient on the paranoid
57000 continuum. Thirty three protocol judges returned the rated protocols
57100 properly filled out and all were used in our data.
57200
57300 The interviews with nonparanoid patients were included to
57400 control for the hypothesis that any teletyped interview with a
57500 patient might be judged "paranoid". However, virtually all of the
57600 ratings of the non-paranoid interviews were zero for paranoia, Hence
57700 the hypothesis was falsified.
57800
57900
58000 The first index of indistnguishability between model and
58100 patient was simple: namely, the final overall rating given the
58200 patient and the model. The question was: "Which was rated as being
58300 more paranoid, the patient, the model, or neither?" (See Table 1).
58400 The protocol judges were more likely to distinquish the overall
58500 paranoid level of the model and the patient. In 37.5% of the paired
58600 interviews, the interview judges gave tied scores to the model and
58700 the patient as contrasted to only 9% of the protocol judges. Of the
58800 35 non-tied paired ratings, 15 rated the model as being more
58900 paranoid. If p is the theoretical probability of a judge judging the
59000 model more paranoid than a human paranoid patient, we find the 95%
59100 confidence interval for p to be 0.27 to 0.59. Since p=0.5 indicates
59200 indistinguishability of overall ratings of model and patient and our
59300 observed p=0.43, the results support the claim that the model is a
59400 successful simulation of a paranoid patient.
59500
59600 Separate analysis of the strong and weak versions of PARRY
59700 indicated that the strong version was indeed judged more paranoid
59800 than the paranoid patients, the weak version less paranoid. Thus a
59900 change in the parameter structure of the paranoid model produced a
60000 change along the dimension of paranoid behavior in the expected
60100 direction.
60200
60300 (TABLE 1)
60400 Relative final overall ratings of paranoid model vs. paranoid
60500 patient indicating which was given highest overall rating of paranoia
60600 at end of interview.)
60700 (INSERT TABLE 1 HERE)
60800
60900
61000
61100
61200
61300
61400
61500
61600 ANALYSIS (2)
61700 The second index of indistinguishability is a more sensitive
61800 measure based on the two series of response ratings in the paired
61900 interviews. The statistic used is basically the standardized
62000 Mann-Whitney statistic (Siegel,1956).
62100 (INSERT EQUATION HERE)
62200
62300 where R is the sum of the ranks of the response ratings in the series
62400 of ratings given to the model, n the number of responses given by the
62500 model, and m the number of responses given by the patient. If the
62600 ratings given by a judge are randomly allocated to model and patient,
62700 i.e. model and patient are indistinguishable in response ratings, the
62800 expected value of Z is 0, with unit standard deviation. If higher
62900 ratings are more likely to be assigned to the model, Z is positive
63000 and conversely, negative values of Z indicate greater likelihood of
63100 assigning higher ratings to the patient. Each judge in evaluating a
63200 pair of interviews generates a single value of Z.
63300
63400 The overall mean of the Z scores was -0.044 with the standard
63500 deviation 1.68 (df=40). Thus the overall 95% confidence interval for
63600 the asymtotic mean value of Z is -0.485 to +0.573. The range of Z
63700 values is -3.8 to +4.46. The length of the confidence interval is a
63800 result of the large variance which itself is mainly related to the
63900 contrast between the weak and strong versions. (See TABLES 2 and 3).
64000 Once again the strong version of the model is more paranoid than the
64100 patients, the weak version less paranoid.
64200
64300 (INSERT TABLE 2)
64400 (SUMMARY STATISTICS OF Z RATINGS BY GROUP)
64500
64600
64700
64800
64900
65000
65100
65200
65300
65400 It is not surprising that results using the two indices of
65500 indistinguishability are parallel, since the indices are highly
65600 interrelated. The mean Z value for the 15 interviews on which the
65700 model was rated more paranoid was +1.28; on the 6 where model and
65800 patient tied: 0.41; on the 20 in which the patient was more paranoid:
65900 -0.993. A positive value of Z was observed when the patient was
66000 given an overall rating greater than the model 6 times; a negative
66100 value of Z when the model was rated more paranoid twice.
66200
66300 (INSERT TABLE 3)
66400 (Analysis of Variance of Z Ratings)
66500
66600
66700
66800
66900
67000
67100
67200
67300
67400
67500
67600
67700
67800 It is worth emphasizing that these tests invited refutation
67900 of the model. The experimental design of the tests put the model in
68000 jeopardy of falsification. If the paranoid model did not survive
68100 these tests, i.e. if it were not considered paranoid by expert
68200 judges and if there were no correlation between the weak-strong
68300 versions of the model and the severity ratings of the judges, then,
68400 since these represent instances of disconfirmation, no claim
68500 regarding the success of the simulation could be made. If a model
68600 becomes falsified through disconfirmation instances, then it is
68700 abandoned as not worth working with further. Survival of potentially
68800 falsifying tests constitutes a validating step for a model.
68900
69000 ANALYSIS (3) THE MACHINE QUESTION
69100 For quite a long time people have wondered how to distinguish
69200 a man from an imitation of a man. The Greeks made statues so
69300 lifelike, it is said, they had to be chained down to keep them from
69400 walking away. To distinguish a man from a statue, Galileo suggested
69500 tickling each with a feather. To distinguish a man from a machine
69600 Descartes proposed conversational tests which the machine, lacking
69700 the ability to make appropriate replies, would fail. Turing's
69800 imitation games have been discussed on p.000. As heirs to this
69900 tradition, we perhaps inevitably became curious how judges using
70000 transcripts might answer the machine-question, i.e. which interviewee
70100 is a human and which is the computer model?
70200 To ask the machine-question, we sent interview transcripts,
70300 one with a patient and one with PARRY, to 100 psychiatrists randomly
70400 selected from the Directory of American Specialists and the Directory
70500 of the American Psychiatric Association. Of the 41 replies, 21 (51%)
70600 made the correct identification while 20 (49%) were wrong. Based on
70700 this random sample of 41 psychiatrists, the 95% confidence interval
70800 is between 35.9 and 66.5. The results indicate chance guessing.
70900 Psychiatrists are considered expert judges of patient-
71000 interview behavior but they are unfamiliar with computers. Hence we
71100 conducted the same test with 100 computer scientists randomly
71200 selected from the membership list of the Association for Computing
71300 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
71400 were wrong. Based on this random sample of 67 computer scientists the
71500 95% confidence interval ranges from 36 to 60. Again the results are
71600 close to a chance level.
71700 So both computer scientists and psychiatrists were unable, at
71800 better than a random guessing level, to distinguish transcripts of
71900 interviews with the model from transcripts of interviews with real
72000 patients.
72100 But what do we learn from asking the machine-question and
72200 finding that the distinction is not made? What we would most like to
72300 know is how to improve the model. Simulation models do not spring
72400 forth in a complete, perfect and final form; they must be gradually
72500 developed over time. Pehaps a correct model-patient distinction
72600 might be made if we allowed a large number of expert judges to
72700 conduct the interviews themselves rather than studying transcripts of
72800 other interviewers. This would indeed indicate that the model must
72900 be improved. But unless we systematically investigated how the judges
73000 succeeded in making the discrimination, we would not know what
73100 aspects of the model to work on. The logistics of such a design are
73200 immense, and obtaining a large number of judges for sound statistical
73300 inference would require an effort incommensurate with the information
73400 yielded.
73500
73600 ANALYSIS (4) MULTIDIMENSIONAL EVALUATION
73700 A more efficient and informative way to use Turing-like tests
73800 is to ask judges to make ratings along scaled dimensions from
73900 teletyped interviews. This might be called asking the "dimension
74000 question". One can then compare scaled ratings of the patients and
74100 the model in order to determine precisely where and by how much they
74200 differ. In constructing our model we strove for one which exhibited
74300 indistinguishability along some dimensions and distinguishability
74400 along others. That is, we wanted the model to converge on what it was
74500 intended to simulate and to diverge from that which it was not. Since
74600 a model represents a simplification and a partial approximation, a
74700 perfect fit is not to be expected.
74800 Paired-interview transcripts were sent to another 400
74900 randomly-selected psychiatrists asking them to rate the responses of
75000 the two "patients" along multiple dimensions. The judges were divided
75100 into groups, each judge being asked to rate responses of each I-O
75200 pair in the interviews along four dimensions. The total number of
75300 dimensions in this test was twelve: linguistic noncomprehension,
75400 thought disorder, organic brain syndrome, bizarreness, anger, fear,
75500 ideas of reference, delusions, mistrust, depression, suspiciousness
75600 and mania. These are dimensions which psychiatrists commonly use in
75700 evaluating patients. There were three groups of judges, each group
75800 being assigned 4 of the 12 dimensions.
75900
76000 (INSERT TABLE 4 HERE)
76100
76200 Table 4 shows there were significant differences, with PARRY
76300 receiving higher scores along the dimensions of linguistic
76400 noncomprehension, thought disorder, bizarreness, anger, mistrust and
76500 suspiciousness. On the delusion dimension the patients were rated
76600 significantly higher. There were no significant differences along
76700 the dimensions of organic brain syndrome, fear, ideas of reference,
76800 depression and mania.
76900 Whereas tests asking the machine-question indicate
77000 indistinguishability at the gross level, a study of the finer
77100 structure of the model's behavior through ratings along scaled
77200 dimensions shows significant differences between patients and model.
77300 These differences are of help to us in suggesting what areas of the
77400 model should be modified to improve its performance. The graph of
77500 Fig. 2 shows that no modifications are necessary along the dimension
77600 of "organic brain syndrome". But it is clear that PARRY'S
77700 language-comprehension might be improved. Then a future dimensional
77800 test would tell whether improvement had occurred and by how much.
77900 Successive identification of particular areas of failure provides a
78000 type of sensitivity analysis which makes clear what improvements
78100 should be pursued in developing more adequate model versions.
78200
78300 (INSERT FIG. 2 HERE)
78400
78500 ANALYSIS (5) A RANDOM MODEL
78600 Further evidence that the machine-question is too low a
78700 hurdle for a simulation model and too insensitive a test, comes from
78800 the following experiment. In this test we constructed a random
78900 version of the paranoid model (RANDOM-PARRY) which utilized PARRY'S
79000 output statements, but expressed them randomly independent of what
79100 the interviewer said. Two psychiatrists conducted interviews with
79200 this model, transcripts of which were paired with patient interviews
79300 and sent to 200 randomly-selected psychiatrists asking both the
79400 machine-question and the dimension-question. Of the 69 replies to
79500 the machine question, 34 (49%) were right and 35 (51%) wrong. Based
79600 on this random sample of 69 psychiatrists, the 95% confidence
79700 interval ranges from 39 to 63, again indicating chance guessing. When
79800 a poor model, such as a random one, passes a test, it strongly
79900 suggests the test is weak.
80000
80100 (INSERT TABLE 5 HERE)
80200
80300 Although a distinction is not made when the simple machine-
80400 question is asked, definite distinctions ARE made when judgements are
80500 requested along specific dimensions. As shown in Table 5,
80600 significant differences appear along the dimensions of linguistic
80700 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
80800 rated higher. On these particular dimensions we can construct a
80900 continuum in which the random version represents one extreme, the
81000 actual patients another. Nonrandom PARRY lies somewhere between these
81100 two extremes, indicating that it performs significantly better than
81200 the random version but still requires improvement before it can be
81300 considered indistinguishable from patients relative to these
81400 dimensions. Table 6 presents t values for differences between mean
81500 ratings of PARRY and RANDOM-PARRY. (See Table 6 and Fig.2 for the
81600 mean ratings).
81700
81800 (INSERT TABLE 6 AND FIG 2 HERE)
81900
82000 These studies show that a more useful way to use Turing-like
82100 indistinguishability tests is to ask expert judges to make ratings
82200 along multiple dimensions deemed essential to the model. Thus the
82300 model can serve as an instrument for its own perfection. A good
82400 validation procedure has criteria for better or worse approximations.
82500 Useful tests do not necessarily prove a model; they probe it for its
82600 strengths and weaknesses, award it plusses and minuses, and clarify
82700 what is to be done next in the way of modification and repair. Simply
82800 asking the machine-question yields little information relevant to
82900 what the model builder most wants to know, namely, along which
83000 dimensions does the model need to be modified in order to effect an
83100 improvement in its performance?
83200
83300 To conclude, it is perhaps historically significant that
83400 these tests were conducted at all. To my knowledge, no one to date
83500 has subjected an interactive simulation model of human symbolic
83600 processes to multidimensional indistinguishability tests. These tests
83700 set a precedent and provide a standard against which competing models
83800 might be measured.